ScreenSpot-Pro

A GUI grounding benchmark for professional high-resolution computer use — testing whether AI can locate tiny UI elements across 23 applications and 5 industries

Published

September 11, 2025

Keywords: ScreenSpot-Pro, GUI grounding, GUI agent, screen understanding, UI element localization, high-resolution, professional software, multimodal LLM, computer use, visual grounding, Photoshop, AutoCAD, VSCode, MLLM benchmark, ScreenSeekeR

Introduction

GUI agents — AI systems that can operate computer interfaces on behalf of users — represent one of the most ambitious frontiers in AI. But while models have made progress on simple tasks like web browsing and mobile navigation, they collapse on professional software. The dense toolbars, tiny icons, and high-resolution multi-panel layouts of applications like Photoshop, AutoCAD, MATLAB, and Visual Studio Code remain far beyond their reach.

ScreenSpot-Pro quantifies this gap. It is a GUI grounding benchmark featuring 1,581 expert-annotated tasks across 23 professional applications, 5 industries, and 3 operating systems — all captured at authentic high resolutions. The challenge: given a natural language instruction and a full-screen screenshot, locate the exact UI element to click. Targets occupy only 0.07% of the screen area on average — 29× smaller than the original ScreenSpot benchmark.

“Existing GUI grounding models perform poorly on this dataset, with the best model achieving only 18.9%.” — ScreenSpot-Pro Paper

graph LR
    A["ScreenSpot<br/>Cropped screenshots<br/>Target: 2.01% of image"] --> B["Too easy<br/>for frontier models"]
    B --> C["ScreenSpot-Pro<br/>Full-screen, high-res<br/>Target: 0.07% of image"]
    C --> D["Tests real-world<br/>professional GUI<br/>grounding"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is ScreenSpot-Pro?

ScreenSpot-Pro is a benchmark that evaluates whether multimodal large language models (MLLMs) can ground natural language instructions to precise UI element locations in high-resolution professional screenshots. Unlike prior benchmarks that used cropped or simplified screenshots, ScreenSpot-Pro uses full, unmodified screen captures from real expert workflows.

Key Characteristics

Feature	Details
Total tasks	1,581 instructions (each in a unique screenshot)
Applications	23 across 5 professional industries + OS commons
Operating systems	Windows, macOS, Linux
Resolution	>1080p (1920×1080), including dual-monitor setups
Target size	0.07% of image area on average (29× smaller than ScreenSpot)
Element types	Text (62.6%) and Icons (37.4%)
Annotation	Expert users with 5+ years experience; dual-reviewer quality control
Multilingual	English + Chinese instructions for all tasks
License	CC BY 4.0

Applications and Industries

ScreenSpot-Pro covers a uniquely diverse range of professional software:

graph TD
    SSP["ScreenSpot-Pro<br/>23 Applications"] --> DEV["Development<br/>& Programming"]
    SSP --> CRE["Creative<br/>Software"]
    SSP --> CAD["CAD &<br/>Engineering"]
    SSP --> SCI["Scientific &<br/>Analytical"]
    SSP --> OFF["Office<br/>Suite"]
    SSP --> OS["Operating System<br/>Commons"]

    DEV --> D1["VSCode · PyCharm<br/>Android Studio<br/>Quartus · VMware"]
    CRE --> C1["Photoshop · Premiere<br/>Illustrator · Blender<br/>FruitLoops · Unreal Engine<br/>DaVinci Resolve"]
    CAD --> CA1["AutoCAD · SolidWorks<br/>Inventor · Vivado"]
    SCI --> S1["MATLAB · Origin<br/>Stata · EViews"]
    OFF --> O1["Word · PowerPoint<br/>Excel"]
    OS --> OS1["Windows 11<br/>macOS · Linux"]

    style SSP fill:#e74c3c,color:#fff,stroke:#333
    style DEV fill:#3498db,color:#fff,stroke:#333
    style CRE fill:#27ae60,color:#fff,stroke:#333
    style CAD fill:#f39c12,color:#fff,stroke:#333
    style SCI fill:#8e44ad,color:#fff,stroke:#333
    style OFF fill:#e67e22,color:#fff,stroke:#333
    style OS fill:#6cc3d5,color:#fff,stroke:#333

What Makes It So Hard?

The core difficulty comes from three compounding factors:

Professional complexity — applications like AutoCAD and MATLAB have hundreds of densely packed buttons, menus, and panels
High resolution, tiny targets — at full-screen resolution, the target element averages only 0.07% of the image area
Specialized icons — professional tools use domain-specific icons that are rarely seen in web training data

In the original paper, GPT-4o scored only 0.8% on direct grounding — barely above random chance. Even the best specialist model (OS-Atlas-7B) achieved just 18.9%.

Who Built It?

ScreenSpot-Pro was developed by researchers at the National University of Singapore (NUS), East China Normal University, and Hong Kong Baptist University:

Kaixin Li, Zhiyong Huang, Tat-Seng Chua — National University of Singapore
Ziyang Meng — East China Normal University
Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma — Hong Kong Baptist University

The benchmark was published at the Workshop on Reasoning and Planning for Large Language Models (2025).

Resource	Link
arXiv paper	arxiv.org/abs/2504.07981
Leaderboard	gui-agent.github.io/grounding-leaderboard
GitHub	github.com/likaixin2000/ScreenSpot-Pro-GUI-Grounding

What Skills Does It Test?

ScreenSpot-Pro evaluates a very specific but critical capability: GUI visual grounding — the ability to translate a natural language instruction into a precise screen coordinate.

Capability	What It Tests
High-resolution perception	Processing screenshots at >1080p without losing detail
Tiny element localization	Finding targets occupying 0.07% of the screen area
Professional domain knowledge	Understanding industry-specific UI patterns (toolbars, panels, menus)
Icon comprehension	Recognizing specialized icons (e.g., blend modes in Photoshop, circuit symbols in Vivado)
Cross-platform understanding	Working across Windows, macOS, and Linux interfaces
Bilingual instruction following	Grounding from both English and Chinese instructions

Example Tasks

Tasks range from straightforward to highly specialized:

“Refresh the file explorer” — VSCode (icon target)
“Unlink audio and video” — Premiere (text target in a context menu)
“Change the coordinate mode of the object” — Blender (icon target in a dense toolbar)
“Select the SM1.smf file in Quartus window” — Quartus (text target in a file browser)
“Disable masking” — Origin (tiny icon in a crowded toolbar)

Current Leaderboard

The leaderboard below shows model accuracy on ScreenSpot-Pro. The metric is click accuracy: whether the model’s predicted click point falls within the annotated ground-truth bounding box.

Source: ScreenSpot-Pro Leaderboard (consulted March 29, 2026). Last updated November 17, 2025. Results collected using greedy decoding; micro-average numbers reported.

Top 20 Models

Rank	Model	Overall (%)
1	KV-Ground-GuiOwl1.5-0315-8B-ZoomIn	80.5
2	Holo2-235B-A22B (Agentic)	78.5
3	MAI-UI-32B (MVP)	77.5
4	KV-Ground-GuiOwl1.5-4B-0228-ZoomIn	76.4
5	Holo2-30B-A3B (Agentic)	75.2
6	MVP_Qwen3VL-32B	74.1
7	MAI-UI-32B (Zoom In)	73.5
8	KV-Ground-GuiOwl1.5-0315-8B	73.2
9	MAI-UI-8B (Zoom In)	71.9
10	Holo2-8B (Agentic)	71.4
11	AdaZoom-GUI-Refine	71.3
12	Holo2-235B-A22B	70.6
13	KV-Ground-Qwen3VL-4B-ZoomIn	70.3
14	UI-Venus-1-5-30B-A3B	69.6
15	Holo2-4B (Agentic)	68.6
16	UI-Venus-1-5-8B	68.4
17	MAI-UI-32B	67.9
18	KV-Ground-GuiOwl1.5-0228-4B	67.0
19	Holo2-30B-A3B	66.1
20	MAI-UI-8B	65.7

Notable General-Purpose Models

Rank	Model	Overall (%)
41	Qwen2.5-VL-72B-Instruct	53.3
49	Qwen2.5-VL-32B-Instruct	48.0
56	UI-TARS-72B	38.1
70	GPT5-minimal (resized)	18.5
71	Claude (Computer Use)	17.1
83	GPT-4o	0.8

Key takeaways:

The best model (KV-Ground-GuiOwl1.5-8B with ZoomIn) achieves 80.5% — a massive leap from the original paper’s best of 18.9%, driven by visual search strategies that narrow the search area
Agentic / multi-round methods dominate the top ranks — models that zoom into candidate regions outperform single-pass approaches
General-purpose VLMs (GPT-4o, Claude Computer Use) still struggle severely on direct grounding in professional high-res environments
Even GPT-5 in minimal mode reaches only 18.5% when images are simply resized

Where to Explore the Benchmark

Dashboards and Resources

Resource	Description	Link
Official Leaderboard	Live leaderboard with per-application breakdown across 23 software	gui-agent.github.io/grounding-leaderboard
GitHub Repository	Evaluation code, configs, and inference scripts	github.com/likaixin2000/ScreenSpot-Pro-GUI-Grounding
Hugging Face Dataset	The 1,581-task dataset with screenshots and annotations	huggingface.co/datasets/likaixin/ScreenSpot-Pro
arXiv Paper	Full technical paper with methodology and analysis	arxiv.org/abs/2504.07981

Load the Dataset

from datasets import load_dataset

dataset = load_dataset("likaixin/ScreenSpot-Pro")
print(f"Number of tasks: {len(dataset['test'])}")
# Number of tasks: 1581

Understanding the Metrics

Click Accuracy

The primary metric is straightforward: given a model’s predicted click point (x, y), does it fall inside the annotated ground-truth bounding box? For models that output bounding boxes instead of points, the center of the predicted box is used.

Per-Category Breakdown

The leaderboard reports accuracy per application, which reveals where models excel vs. struggle:

Category	Challenge Level	Why
Office Suite	Moderate	Familiar UI patterns, used in web training data
OS Commons	Moderate	Standard system interfaces
Development	Hard	Dense code editors, many small icons
Creative	Very Hard	Custom UIs, non-standard toolbars
CAD & Engineering	Very Hard	Extremely dense, specialized icons
Scientific	Very Hard	Domain-specific plots, menus with many entries

Text vs. Icon Targets

Icons are consistently harder to ground than text elements — models can leverage OCR capabilities for text but must rely on visual understanding for icons. In the original paper, OS-Atlas-7B scored 28.1% on text but only 4.0% on icons.

ScreenSeekeR: The Breakthrough Approach

The paper introduced ScreenSeekeR, an agentic visual search framework that dramatically improves grounding accuracy by narrowing the search area rather than trying to locate elements in the full high-resolution image. This insight — that reducing the search space matters more than increasing model size — proved foundational for the leaderboard leaders.

graph TD
    A["Full Screenshot<br/>High resolution"] --> B["Planner (GPT-4o)<br/>Predicts candidate regions"]
    B --> C["Score & Filter<br/>Candidate areas"]
    C --> D["Crop & Zoom<br/>Into top candidates"]
    D --> E["Grounder Model<br/>Locates target in<br/>simplified sub-image"]
    E --> F["Verify Result<br/>Planner checks<br/>correctness"]

    style A fill:#ecf0f1,color:#333,stroke:#bdc3c7
    style B fill:#3498db,color:#fff,stroke:#333
    style C fill:#f39c12,color:#fff,stroke:#333
    style D fill:#e67e22,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333
    style F fill:#8e44ad,color:#fff,stroke:#333

ScreenSeekeR boosted OS-Atlas-7B from 18.9% → 48.1% — a 254% relative improvement — without any additional training. This cascaded zoom-and-search approach inspired many of the top leaderboard methods (ZoomIn, MVP, Agentic variants).

Why ScreenSpot-Pro Matters

graph LR
    A["GUI agents need<br/>professional software<br/>capabilities"] --> B["Existing benchmarks<br/>too simple"]
    B --> C["ScreenSpot-Pro<br/>fills the gap"]
    C --> D["Better GUI agents<br/>for real productivity"]

    A2["High-res screens<br/>tiny UI targets"] --> B2["Models fail at<br/>precise localization"]
    B2 --> C
    C --> D2["Focus on<br/>visual search<br/>strategies"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style A2 fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style D2 fill:#3498db,color:#fff,stroke:#333

Tests what matters for real productivity — Professional software is where GUI agents could deliver the most value, yet it’s the hardest environment
Exposes the resolution bottleneck — Models that work on cropped screenshots fail catastrophically at full-screen resolution
Validates visual search — The massive gap between single-pass (18.9%) and agentic zoom approaches (80.5%) proves that search strategy is critical
Diverse and authentic — 23 applications across 5 industries, annotated by domain experts during real workflows
Active community — 84 model submissions on the leaderboard and growing

Video: ScreenSpot-Pro Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

ScreenSpot-Pro reveals a critical truth about AI GUI agents:

1,581 expert-annotated tasks across 23 professional applications — from Photoshop and AutoCAD to MATLAB and Blender
Targets occupy only 0.07% of the screen — 29× smaller than the original ScreenSpot benchmark
General-purpose models like GPT-4o score < 1% on direct grounding in professional environments
Visual search strategies (zoom-and-crop) are the key breakthrough, with the best agentic methods reaching 80.5%
The gap between single-pass (18.9%) and multi-round approaches (80.5%) proves that the problem is not just about better models but about smarter search

As GUI agents evolve from web browsing toys into serious productivity tools, ScreenSpot-Pro provides the benchmark that measures whether they can handle the software that professionals actually use.

References

Li, K., Meng, Z., Lin, H., Luo, Z., Tian, Y., Ma, J., Huang, Z., Chua, T.-S. “ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use.” Workshop on Reasoning and Planning for Large Language Models, 2025. arxiv.org/abs/2504.07981
Li, K. et al. “ScreenSpot-Pro Leaderboard.” gui-agent.github.io/grounding-leaderboard (consulted March 29, 2026)
Li, K. et al. “ScreenSpot-Pro Dataset.” Hugging Face. huggingface.co/datasets/likaixin/ScreenSpot-Pro
Li, K. et al. “ScreenSpot-Pro GitHub Repository.” github.com/likaixin2000/ScreenSpot-Pro-GUI-Grounding

Explore how models handle document understanding — see OmniDocBench 1.5
Evaluate multimodal models on college-level visual reasoning — see MMMU-Pro
Understand expert-level AI evaluation — see Humanity’s Last Exam
Deploy models for running your own evaluations — see Deploying and Serving LLM with vLLM
ScreenSpot-Pro Leaderboard
ScreenSpot-Pro Dataset on Hugging Face
ScreenSpot-Pro GitHub Repository